Project Report¶

Navigating Germany's Rental Landscape for International Students¶

Summary¶

The project aims to analyze data concerning international students and rental properties in bayren, Germany. By examining this information, the project seeks to provide valuable insights to newcomers, helping them identify cities in bayren, Germany with the most favorable accommodation options and thus facilitating their transition to the country.

Rationale¶

Through an examination of the trends in the influx of foreign students across various cities in Bavaria, Germany, and an assessment of the corresponding accommodation options in these cities, this analysis aims to offer valuable recommendations. These insights can be instrumental for prospective students, landlords, and local authorities in identifying cities with higher probabilities of finding suitable accommodation, thereby facilitating informed decision-making in the pursuit of education in Bavaria.

Project Flow¶

  1. Data Acquisition:
  • Obtain information about rental property listings from Immoscout24 from the provided metadata URL and download the corresponding csv file.
  • Obtain the Foreign students dataset from the provided metadata URL and download the corresponding xlsx file.

Datasource1: Foreign students in Germany

  • Metadata URL: https://data.europa.eu/data/datasets/https-www-datenportal-bmbf-de-portal-2-5-34
  • Data URL: https://www.datenportal.bmbf.de/portal/Tabelle-2.5.34.xls
  • Data Type: xlsx

This dataset comprises information about international students admitted to degree programs in bayren Germany over a specific time period. It includes data on the students' nationalities, degree programs, admission dates, and other relevant attributes.

Datasource2: Immoscout24 dataset

  • Metadata URL: https://www.kaggle.com/datasets/corrieaar/apartment-rental-offers-in-germany
  • Data URL: https://www.kaggle.com/datasets/corrieaar/apartment-rental-offers-in-germany/download?datasetVersionNumber=6
  • Data Type: xlsx

This dataset comprises information about rental property listings from Immoscout24, which is the largest real estate platform in bayren Germany. It includes details about rental prices, property characteristics, and the property's location.

Used Libraries¶

Import all necessary Libraries

In [12]:
# Import all necessary libraries
import sqlite3
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
import pandas as pd
import folium
import matplotlib.pyplot as plt
from IPython.display import display
In [42]:
# Display muninch_map
image = Image.open('output/project-flow.png')
display(image)

Data Extraction¶

Immoscout dataset was sourced from Kaggle, by enabling Kaggle API, from kaggle "API" section and click on "Create New API Token". This will download a file named kaggle.json to your computer then Store Kaggle API Key in your system and the second dataset of Forigen students in Bavaria Germany was sourced from europa.eu.

#Data Extraction
def fetch_kaggle_dataset(dataset, target_folder, filename):
    api = KaggleApi()
    api.authenticate()
    api.authenticate()
    username, dataset_name = dataset.split('/')[-2:]
    zip_file_path = os.path.join(target_folder,f"{dataset_name}.zip")
    api.dataset_download_files(f"{username}/{dataset_name}", path=target_folder, unzip=False)
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extract(filename, path=target_folder)

def data_extraction_csv(path):
    t1 = time()
    print("Data Extraction in progress...")

    try:
        df = pd.read_csv(path, encoding='latin-1')
    except Exception as e:
        print("Error occurred during file reading:", str(e))
        return None
    t2 = time()
    print("Finish: Data Extraction {} s ".format(t2 - t1))
    return df

    def data_extraction_xls(path):
    t1 = time()
    print("Data Extraction in progress...")

    try:
        df = pd.read_excel(path, engine='openpyxl')
    except Exception as e:
        print("Error occurred during file reading:", str(e))
        return None
    t2 = time()
    print("Finish: Data Extraction {} s ".format(t2 - t1))
    return df

Data Cleaning and Preprocessing:¶

  1. Load and explore the Immoscout24 and Forigen Students in Germany dataset.
  2. Clean the datasets by removing irrelevant columns and handling missing values.
  3. Perform any necessary data transformations or feature engineering.

Data Transformation

# Check if column is empty and drop corresponding rows
    df1.dropna(subset=['regio1', 'regio2', 'noRooms'], inplace=True)

    # Fetch only Bayren state dataset
    df1 = df1[df1['regio1'] == 'Bayern']

    # correct the formats of data values
    df1['regio2'] = df1['regio2'].str.encode('latin-1').str.decode('utf-8')

    # only fetch data with city name available in students data
    university_cities = df2['City'].unique()
    df1 = df1[df1['regio2'].isin(university_cities)]

    # Format dataset
    df1 = data_transformation(df1, df1_rename_cols, df1_drop_cols)
    data_loader("../dataset.sqlite", df1, "immoscout")

     # correct date format in year cloumn
    df2['Jahr'] = pd.to_datetime(df2['Jahr'].str.split('/').str[0], format='%Y')
    df2['Jahr'] = df2['Jahr'].dt.strftime('%Y')

    #drop rows with num values
    df2 = df2.dropna()

    # List of columns to convert from float64 to int
    columns = ['Geisteswissenschaften', 'Sozialwissenschaften', 'Mathematik',
                 'Ingenieurwissenschaften', 'Informatik',
                 'medizin', 'Landwirtschaft']

    # Convert selected columns to int
    df2[columns] = df2[columns].astype(int)

Rename Columns

df1_rename_cols = {
        "regio1": "federalState",
        "geo_plz": "zipCode",
        "regio2": "City",
        "regio3": "Town"
    }

    df2_rename_cols = {
            "Jahr": "Year",
            "Stadt": "City",
            "Universität": "University",
            "Geisteswissenschaften": "Humanities",
            "Sozialwissenschaften": "Social sciences",
            "Mathematik": "Mathematics",
            "Ingenieurwissenschaften": "Engineering sciences",
            "Informatik": "Computer Science",
            "medizin": "Medicine",
            "Landwirtschaft": "Agriculture"
        }

Drop Columns

df1_drop_cols = ["picturecount", "scoutId", "geo_bln", "geo_krs", "telekomHybridUploadSpeed", "telekomTvOffer", "newlyConst", "balcony", "picturecount", "pricetrend", "telekomUploadSpeed", "scoutId", "firingTypes", "yearConstructedRange", "interiorQual", "petsAllowed", "streetPlain", "lift", "baseRentRange", "typeOfFlat", "energyEfficiencyClass", "lastRefurbish", "electricityBasePrice", "electricityKwhPrice", "date"]

Extract Latitude and Longitude for Immoscout24 Dataset¶

# Query the database and load all data into a DataFrame
immoscout_table = "SELECT * FROM immoscout"
immoscout_table = pd.read_sql_query(immoscout_table, conn)

# Create a geocoder instance
geolocator = Nominatim(user_agent="my_app")

# Initialize lists to store latitude and longitude values
latitudes = []
longitudes = []

# Iterate over each row in the DataFrame
for index, row in immoscout_table.iterrows():
    # Check if the city is Munich or Nuremberg
    if row['City'] in ['München', 'Nürnberg']:
        address = f"{row['houseNumber']} {row['street']}, {row['Town']}, {row['City']}, {row['zipCode']} , {row['federalState']}"
        try:
            location = geolocator.geocode(address, timeout=10)
            if location is not None:
                latitudes.append(location.latitude)
                longitudes.append(location.longitude)
                print(address, ":", location.latitude, location.longitude)
            else:
                latitudes.append(None)
                longitudes.append(None)
        except GeocoderTimedOut:
            print(f"Geocoding timed out for address: {address}")
    else:
        # For other cities, use the existing latitude and longitude values
        latitudes.append(None)
        longitudes.append(None)

# Add latitude and longitude columns to the DataFrame
immoscout_table['latitude'] = latitudes
immoscout_table['longitude'] = longitudes

# Save the updated DataFrame back to the database
immoscout_table.to_sql('immoscout', conn, if_exists='replace', index=False)

Database Structure¶

In presenting the initial facets of our dataset, I have included the display of the dataframe head. This snapshot offers a brief yet insightful overview of the dataset's architecture, showcasing variable names, data types, and sample values. This intentional presentation serves as a precursor to more in-depth analyses and discussions, enabling a swift understanding of the dataset's composition before further exploration.

  1. Immoscout24 Database:
In [43]:
# Connect to the database
conn = sqlite3.connect('../dataset.sqlite')  

# Query the database and load data into a DataFrames
immoscout_table = f"SELECT * FROM immoscout"
immoscout_df = pd.read_sql_query(immoscout_table, conn)
immoscout_df.head()
Out[43]:
federalState serviceCharge heatingType totalRent yearConstructed noParkSpaces hasKitchen cellar baseRent houseNumber ... noRoomsRange garden livingSpaceRange City Town description facilities heatingCosts latitude longitude
0 Bayern 180.0 central_heating 2030.0 1871.0 0.0 1 1 1850.0 0 ... 2 0 4 München Altstadt Die Wohnung (OHNE Balkon !!) liegt im 4. OG (O... - Eichenparkettboden im Wohn-EZ u. im Schlafzi... 0.0 NaN NaN
1 Bayern 0.0 0 2400.0 1956.0 1.0 0 0 2400.0 3 ... 3 0 3 München Schwabing_West Highlights:\n\n * alle Wände der Wohnräume, ... Waschmaschine in der Wohnung\nFlachbildfernseh... 0.0 NaN NaN
2 Bayern 170.0 floor_heating 970.0 1974.0 1.0 1 1 800.0 0 ... 2 0 5 Passau Haidenhof_Süd WOHNEN UND ENTSPANNEN!\n \nDie Architektur be... - Gartenwohnung in bester Lage\n- 2 Zimmer + H... 0.0 NaN NaN
3 Bayern 0.0 0 2780.0 2016.0 0.0 0 0 2780.0 51 ... 3 0 4 München Milbertshofen Bitte beachten Sie: Die Wohnung ist komplett m... Waschmaschine in der Wohnung\nFlachbildfernseh... 0.0 48.18808 11.54337
4 Bayern 200.0 gas_heating 0.0 0.0 0.0 1 0 900.0 0 ... 4 0 4 Würzburg Lengfeld Helle 4,5-Zimmerwohnung in ruhigem, kleinen Wo... 0 0.0 NaN NaN

5 rows × 28 columns

  1. Forigen Student Dataset
In [44]:
# Connect to the database
conn = sqlite3.connect('../dataset.sqlite')  

# Query the database and load forigen students data into a DataFrames
intstudents_table = f"SELECT * FROM intstudents"
intstudents_df = pd.read_sql_query(intstudents_table, conn)
intstudents_df.head()
Out[44]:
Region University City Year Humanities Social sciences Mathematics Engineering sciences Computer Science Medicine Agriculture
0 bayren Ostbayerische Technische Hochschule (OTH) Rege... Regensburg 1998 30 16 58 4 74 18 16
1 bayren Ostbayerische Technische Hochschule (OTH) Rege... Regensburg 1999 20 20 58 0 36 20 20
2 bayren Ostbayerische Technische Hochschule (OTH) Rege... Regensburg 2000 27 18 58 2 31 18 18
3 bayren Ostbayerische Technische Hochschule (OTH) Rege... Regensburg 2001 27 18 58 4 36 18 18
4 bayren Ostbayerische Technische Hochschule (OTH) Rege... Regensburg 2002 42 18 58 4 42 18 18

Expolatory Data Analysis¶

  1. Perform correlation analysis between the proximity of rental property availability in Bavaria and the number of foreign students in Bavaria, Germany.
  2. Visualize the correlation using appropriate charts or plots.

Bar Plot¶

A bar plot displaying the distribution of the number of students according to the cities of Bavaria.

In [45]:
# List of columns to convert from float64 to int
columns = ['Humanities', 'Social sciences', 'Mathematics',
                 'Engineering sciences', 'Computer Science',
                 'Medicine', 'Agriculture']

# Calculate total students and add a new column
intstudents_df['Total-Student'] = intstudents_df[columns].sum(axis=1)

columns_to_use = ['City', 'Total-Student']
df_bar = intstudents_df[columns_to_use].groupby('City').sum()

# Sort the DataFrame by Total-Student in ascending order
df_bar = df_bar.sort_values(by='Total-Student', ascending=False)

# Set a seaborn color palette for different cities
city_palette = sns.color_palette("husl", n_colors=len(df_bar))

# Plot the compound bar graph using seaborn
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=df_bar.index, y='Total-Student', data=df_bar, palette=city_palette, hue=df_bar.index)

ax.set_xlabel('Cities')
ax.set_ylabel('No of Students')
ax.set_title('Distribution of Forigen Students in Cities of Bavaria')

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')

# Remove the legend, as it's not needed in this case
ax.legend().set_visible(False)

plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Bar Plot Insight¶

  • Munich has the largest number of students in Bavaria, with over 80,000. This is followed by Nuremberg, with over 60,000 students.
  • The number of students decreases as you move further away from Munich. This is to be expected, as larger cities tend to have more universities and other educational institutions.
  • There is a wide range in the number of students in different cities of Bavaria. For example, Würzburg has over 45,000 students, while Amberg has less than 20,000 students.
  • The number of students is not evenly distributed across Bavaria. Some cities, such as Munich and Nuremberg, have a large number of students, while other cities, such as Amberg and Coburg, have a smaller number of students.

Overall, the bar graph provides a useful overview of the distribution of students in Bavaria. It shows that the number of students is highest in the larger cities, and that there is a wide range in the number of students across different cities.

Trendline Plot¶

A trendline plot displaying the trend of foreign students studying in different cities of Bavaria over the last 20 years.

In [46]:
columns_to_use = ['City', 'Year', 'Total-Student']
df_trendline = intstudents_df[columns_to_use].copy()
df_trendline = df_trendline[df_trendline['Year'] > '1998']
# Group by 'City' and 'Year', and sum the 'Total-Student' values within each group
df_trendline_grouped = df_trendline.groupby(['City', 'Year'])['Total-Student'].sum().reset_index()

# Sort the DataFrame by 'Year'
df_trendline_grouped = df_trendline_grouped.sort_values(by='Year')

sns.set(style="darkgrid")

# Create a trendline plot
plt.figure(figsize=(12, 8))
ax = sns.lineplot(x='Year', y='Total-Student', hue='City', data=df_trendline_grouped, errorbar=None, marker='o', markersize=8)

plt.title('Trendline Plot of Forigen Students Over 20 Years')
plt.xlabel('Year')
plt.ylabel('Total Students')
plt.legend(title='City', loc='upper left', bbox_to_anchor=(1, 1))

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')

plt.show()

Trenline Plot Insight¶

  • The number of foreign students studying in Bavaria has increased by over 75% since 1999.
  • Munich has seen the largest increase in the number of foreign students, with a 90% increase since 1999.
  • Nuremberg has seen the second-largest increase in the number of foreign students, with a 75% increase since 1999.
  • Augsburg and Regensburg have also seen significant increases in the number of foreign students, with increases of 60% and 45%, respectively, since 1999.
  • Amberg and Coburg have seen relatively flat growth in the number of foreign students, with increases of 15% and 5%, respectively, since 1999.

Overall, the trendline plot suggests that Bavaria is becoming an increasingly popular destination for international students. This is likely due to a number of factors, including the high quality of education, the affordable cost of living, and the strong economy of the state.

Pie Plot¶

A pie plot displaying the cities that contain a higher percentage of available apartments in different cities of Bavaria.

In [47]:
apartments_wrt_town = immoscout_df['City'].value_counts()

# Sort the values for better visualization
apartments_wrt_town_sorted = apartments_wrt_town.sort_values(ascending=False)

# Set a seaborn color palette for different cities
town_palette = sns.color_palette("tab10", n_colors=len(apartments_wrt_town_sorted))

# Plot the pie chart using seaborn
plt.figure(figsize=(8, 8))
plt.pie(apartments_wrt_town_sorted.values, labels=apartments_wrt_town_sorted.index, autopct='%1.1f%%', colors=town_palette)
plt.title('Percentage of apartments based on cities in Bavaria')
plt.axis('equal') 
plt.show()

Pie Plot Insight¶

  • Munich has the highest percentage of available apartments in Bavaria, with 41.9%. This is likely due to the fact that Munich is the largest city in Bavaria and has a growing economy.
  • Nürnberg is the second city with the highest percentage of available apartments, with 17.1%. This is also likely due to the fact that Nuremberg is a large city with a growing economy.
  • The cities with the lowest percentage of available apartments are Ansbach, Amberg, Rosenheim, and Coburg. These cities are all located in rural areas of Bavaria.
  • The percentage of available apartments varies significantly between different cities in Bavaria. This suggests that there is a shortage of affordable housing in some parts of the state.

Overall, the pie chart provides a useful overview of the availability of apartments in Bavaria. It shows that there is a shortage of affordable housing in some parts of the state, and that this shortage is particularly acute in rural areas.

Bubble Plot¶

A bubble plot visualizing the locations of apartments available in Munich and Nuremberg based on longitude and latitude values calculated using the apartment addresses.

  • First extracting Munich and Nurnberg apartments data from the dataset.
In [48]:
# Extract data for Munich and Nuremberg
munich_data = immoscout_df[immoscout_df['City'] == 'München'][['latitude', 'longitude']].dropna().values.tolist()
nuremberg_data = immoscout_df[immoscout_df['City'] == 'Nürnberg'][['latitude', 'longitude']].dropna().values.tolist()
  • Bubble Plot for Munich:
In [49]:
# Create a Folium map
map_center = [48.1351, 11.5820]  # Center the map around Munich
munich_map = folium.Map(location=map_center, zoom_start=11)

# Add Circle markers to the map
for location in munich_data:
    folium.Circle(location=location, radius=200, color='blue', fill=True).add_to(munich_map)

# Display muninch_map
image = Image.open('output/munich_map.PNG')
display(image)
  • Bubble Plot for Nurnberg:
In [50]:
# Create a Folium map
map_center = [49.4521, 11.0767]  # Center the map around Munich
nurnberg_map = folium.Map(location=map_center, zoom_start=11)

# Add Circle markers to the map
for location in nuremberg_data:
    folium.Circle(location=location, radius=200, color='blue', fill=True).add_to(nurnberg_map)

# Display nurnberg_map
image = Image.open('output/nurnberg_map.PNG')
display(image)

Bubble Plot Insight¶

  • The concentration of apartments is higher in the city centers of Munich and Nuremberg. This is to be expected, as these are the most densely populated areas and also the most central locations for employment and amenities.
  • There are also some apartment clusters in the suburbs of both cities. This suggests that there is a demand for apartments in these areas, as they offer a more affordable option than living in the city center.

Overall, the bubble plots provide a useful overview of the distribution of apartments in Munich and Nuremberg. They can be used to identify areas with a high concentration of apartments, to compare the price of apartments in different areas, and to understand the factors that influence apartment prices.

Comparision Analysis¶

Filled-In Bar Plot¶

The filled bar graph illustrates the number of foreign students and available accommodations in Bavaria for the year 2022. The data is segmented by cities, providing insights into the accommodation availability for international students in different regions.

  • Calculate the number of available rooms and group by cities.
In [51]:
# Filter the DataFrame to include only rows where 'noRooms' is an integer
filtered_df = immoscout_df[immoscout_df['noRooms'].astype(int) == immoscout_df['noRooms']]

# Group by Cities and calculate the total number of rooms in each city
total_rooms_per_city = filtered_df.groupby('City')['noRooms'].sum().reset_index()
  • Calculate the number of students in each city in 2022.
In [52]:
columns_to_use = ['City', 'Total-Student']
# Fetch student data for year 2022
intstudents_df = intstudents_df[intstudents_df['Year'] == '2022']

#groupby data according to Cities
df_bar = intstudents_df[columns_to_use].groupby('City').sum()
df_bar.reset_index(inplace=True)
df_bar.rename(columns={'City': 'City'}, inplace=True)
  • Calculate the percentage of occupied and available rooms in cities for the year 2022.
In [53]:
# Claulate percatage of number of room occupied and available in cities in 2022
df_bar['Total-Student-percentage'] = (df_bar['Total-Student']/total_rooms_per_city['noRooms'])*100
In [54]:
# Set the Seaborn style to 'whitegrid' for a light background
sns.set_style("whitegrid")

# Create a bar plot with visually attractive color scheme
plt.figure(figsize=(10, 6), facecolor='None')  

# Define visually attractive colors
total_students_color = sns.color_palette("Blues")[4] 
no_rooms_color = sns.color_palette("Oranges")[4]  

ax1 = sns.barplot(x='City', y='noRooms', data=total_rooms_per_city, color=no_rooms_color, edgecolor='black', label='No Rooms')
ax2 = sns.barplot(x='City', y='Total-Student', data=df_bar, color=total_students_color, edgecolor='black', label='Total Students')

# Annotate each bar with values from the 'annotation_column'
for p, annotation_value in zip(ax1.patches, df_bar['Total-Student-percentage']):
    ax1.annotate(f'{annotation_value:.0f}%', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Set labels and title
plt.xlabel('City')
plt.ylabel('Student Count')
plt.title('Total Students and No Rooms by Cities')

# Rotate x-axis labels for better visibility
plt.xticks(rotation=45, ha='right')
# Display the legend
plt.legend()
plt.ylim(0,11000)
# Show the plot
plt.show()

Filled-In Bar Plot Insight¶

  • The number of students is significantly higher than the number of no rooms in all districts. This suggests that students are more likely to stay in dormitories than find other accommodation.
  • The district of München has the most students, with over 10,000. This is followed by Nürnberg and Augsburg.
  • The district of Amberg has the fewest students, with just over 1,000. This is followed by Coburg and Bayreuth.
  • The percentage of students with no rooms is highest in the district of Würzburg, at 94%. This is followed by Ansbach and Erlangen.
  • The percentage of students with no rooms is lowest in the district of Passau, at 42%. This is followed by Ingolstadt and Regensburg.

Overall, the graph shows that there is a high demand for student accommodation in Bavaria. Students are more likely to stay in dormitories than find other accommodation, and the districts with the most students also have the highest percentage of students with no rooms.

Conclusion¶

Upon comprehensive analysis of the dataset concerning foreign students and accommodation options in Bavaria, several key insights emerge. Munich stands out as the city with the highest number of students, reflecting its prominence as a major educational hub. However, when considering accommodation availability and affordability, other cities in Bavaria present noteworthy options for students.

Nuremberg and Augsburg, with substantial student populations and a diverse range of accommodation, emerge as promising choices for students seeking housing. These cities not only offer ample accommodation options but also showcase a growing economy, contributing to the overall attractiveness for students.

Furthermore, Passau, despite having a smaller number of students, stands out for its relatively lower percentage of students without accommodation (42%). This suggests that Passau provides better housing prospects for its student population compared to other districts.

In contrast, districts like Würzburg, Ansbach, and Erlangen demonstrate higher percentages of students without rooms, indicating potential challenges in securing suitable housing. While Würzburg records the highest percentage of students without rooms (94%), signaling a pronounced need for enhanced accommodation solutions in this district.

The pie chart highlighting the availability of apartments in Bavaria emphasizes the shortage of affordable housing in certain regions, particularly in rural areas such as Amberg, Rosenheim, and Coburg. Cities like Munich and Nuremberg, with higher percentages of available apartments, offer more promising prospects for students.

In summary, while Munich remains a significant academic and economic center, Nuremberg, Augsburg, and Passau emerge as cities with favorable accommodation options for students. As the demand for student housing continues to rise, these insights can guide students, landlords, and policymakers in making informed decisions regarding accommodation in Bavaria.